-
Notifications
You must be signed in to change notification settings - Fork 0
[RDST-111] keep tensorboard directory clean during s3 sync #3
Conversation
@@ -38,6 +38,7 @@ def __init__(self, estimator, logdir=None): | |||
threading.Thread.__init__(self) | |||
self.event = threading.Event() | |||
self.estimator = estimator | |||
self.aws_sync_dir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut says this should be self._aws_sync_dir
, but it's not clear to me why some fields are considered public while others aren't, so it probably doesn't matter.
"""Sync to_directory with from_directory by copying each file in | ||
to_directory with new contents. Why do this? Because TensorBoard picks | ||
up temp files from `aws s3 sync` and then stops reading the correct | ||
tfevent files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a relevant TensorBoard issue we could include here? I'm just curious if that could be an easier reference for the future in understanding this and knowing if/when it's fixed, versus using git blame
to track down this discussion.
This is the closest one I came across, though it's slightly different: tensorflow/tensorboard#70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I still think it's probably related to tensorflow/tensorboard#349. I'll put that in the docstring.
from_directory (str): The directory with updated files. | ||
to_directory (str): The directory to be synced. | ||
""" | ||
for root, dirs, files in os.walk(from_directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll want to check to see if to_directory
exists before this loop, then create it if it doesn't. From what I can tell, running aws s3 sync src_dir dst_dir
will create dst_dir
for you, but it looks like we're only attempting to create subdirectories of to_directory
rather than the directory itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fairly elegant and only Python solution that solves a problem that must be affecting quite a lot of people. We should consider opening a PR
upstream.
This PR is a workaround to make sure TensorBoard picks up logs that get synced in a local directory. The problem seems to be that
aws s3 sync
creates temporary files while it's downloading them. These are lexicographically greater than the files they are updating (because they append.<some hash>
to the end). But TensorBoard stops looking at a file when a new one shows up (explanation here).